fix: scope large binary storage and cleanup by execution id#5280
Open
kunwp1 wants to merge 14 commits into
Open
fix: scope large binary storage and cleanup by execution id#5280kunwp1 wants to merge 14 commits into
kunwp1 wants to merge 14 commits into
Conversation
Also update existing call site in RegionExecutionCoordinator to pass None for the new field (required because ScalaPB no_default_values_in_constructor is true).
…he#4123) betterproto returns an empty (falsy) ExecutionIdentity for an unset executionId field rather than None, so the previous `is not None` check never triggered and an unset id would silently produce objects/0/... Use truthiness so unset -> None -> create() raises, matching the JVM invariant. Also moves a stray mid-file `import re` to the top.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5280 +/- ##
============================================
- Coverage 49.12% 49.08% -0.04%
+ Complexity 2378 2375 -3
============================================
Files 1051 1050 -1
Lines 40348 40304 -44
Branches 4279 4267 -12
============================================
- Hits 19821 19784 -37
+ Misses 19368 19358 -10
- Partials 1159 1162 +3
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6a0709e to
94c2804
Compare
…apache#4123) Move the per-execution id out of StorageConfig (which holds only static system configuration sourced from storage.conf) into a dedicated module-level holder in large_binary_manager (set_current_execution_id), mirroring the JVM LargeBinaryManager. The Python init handler sets it via that API.
Add get_current_execution_id() and route create() and the tests through it instead of reading the module-level _current_execution_id directly, keeping the holder's access encapsulated.
Contributor
Author
|
/request-review @Xiao-zhen-Liu Can you review this PR because you are an engine expert? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Large binaries were stored in the shared
texera-large-binariesbucket under flat keysobjects/{timestamp}/{uuid}with no execution id, andclearExecutionResources(eid)deleted all of them viaLargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.This PR namespaces every large binary by its execution id and scopes deletion:
objects/{eid}/{uuid}on both the JVM and Python workers.InitializeExecutorRequest.executionIdproto field, injected by the system at executor init. The user-facinglargebinary()/new LargeBinary()APIs are unchanged.LargeBinaryManager.deleteByExecution(eid)(prefix delete ofobjects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.deleteAllObjects()is removed.Pre-existing objects under the old
objects/{timestamp}/...scheme are left untouched.Any related issues, documentation, discussions?
Closes #4123.
How was this PR tested?
Requires running
./bin/python-proto-gen.shImport the following json file to create two workflows, run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6